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Abstract 

The Open-Source Chemistry Analysis Routines (OSCAR) software, a toolkit for the recognition of named entities and 
data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the 
core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on 
reduction of surface coupling) that permits client programmers to easily incorporate it into external applications. 
OSCAR4 offers a domain-independent architecture upon which chemistry specific text-mining tools can be built, 
and its development and usage are discussed. 



Introduction 

In keeping with the historical and methodological aspects 
of this special issue, we recount the history and motiva- 
tion of OSCAR. 

A large amount of factual data in chemistry and 
neighbouring disciplines is published in the form of text 
and components within text rather than as structured 
semantic information. If we can discover and extract 
this information, the textual literature becomes an enor- 
mous additional chemical resource. As an example, we 
estimate that about 10 million chemical syntheses per 
year are published in the public literature (articles, 
patents, theses) and the conventional method is a nat- 
ural language narrative (most commonly in English). It 
is extremely tedious and error-prone to extract informa- 
tion from this narrative manually, and for this reason 
many chemical abstracting services limit their scope and 
also frequently lag behind the current publication list. 

The discipline of text-mining has now reached a state 
where much natural language in textual form can be 
analysed rapidly and with high precision and recall. 
Methodologies applied to the problem of chemical 
named entity recognition include dictionary- and rule- 
based methods, as well as machine learning and hybrid 
approaches [1-11]. We have been working in this area 
for approximately 10 years and the OSCAR4 software, 
together with OPSIN (the Open Parser for Systematic 
IUPAC Nomenclature) [12,13] and ChemicalTagger 
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[14,15], represent the public state-of-the-art in chemical 
text analysis and extraction. 

The OSCAR (Open-Source Chemistry Analysis Rou- 
tines) software has been developed over a period of 
years and a number of projects. Between 2002 and 2004, 
sponsors including the Royal Society of Chemistry 
(RSC), Nature and the International Union of Crystallo- 
graphy (IUCr) supported a number of summer student- 
ships. These projects were focused on the development 
of software with limited capacity for the automated 
interpretation of chemical documents, and resulted in 
two main software components-the Experimental Data 
Checker [16,17] and OSCAR2. 

The Experimental Data Checker was conceived as a 
tool to be used as part of the RSCs publication process. 
The tool is capable of recognising sections of reported 
experimental data within plain text input using regular 
expressions to match the highly-stylised and journal- 
mandated formats in which they are reported in the lit- 
erature (as shown in Figure 1). Once this information 
has been identified and interpreted, the tool performs 
elementary checks on the characterisation data (spectra, 
analytical) where molecular structures are reported, and 
attempts to ensure that the data does not conflict with 
the structure. 

The Experimental Data Checker application relied 
upon a core library of analysis routines, and it was this 
library that was the first to bear the name OSCAR. 
Further development of this library in the summer of 
2004 resulted in OSCAR2, which used XML formatting 
to represent the document undergoing processing, and 
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Preparation of (3S,aR)-methyl 3-(3-chlorophenyl)-3-(N-a-methyl-4-methoxybenzylamino)propanoate 33 

Following general procedure 5, 29 (1 89g, 4.87mmol) was added to a saturated solution of HCI in MeOH (20ml). Concentration in vacuo, recrystallisation (EtOAc:hexane) and 
treatment with saturated aqueous NaHC03 gave 33 (1 48g, 79%) as a colourless oil; mp (HCI salt) 1 69-1 70'C (EtOAc:hexane); [a]D20 +1 7.2 (c 1 .0, CHCI3); Found C. 59.4; H, 6.0; N, 
3.8%; C1 9H23CI2N03 requires C, 59.4; H 6.0; N, 3.6%; nmax (film) 3329 (NH), 1 737 (C=0); dH (400MHz, CDCI3) 1 .34 (3H, d, J6.5, C(a)Me), 1 .86 (1 H, br s. NH), 2.63 (1 H, dd, 
J2A.2B1 5.4, J2A.36.1 , C(2)HA), 2.73 (1 H, dd. J2FJ.2A1 5.4. J2B.37.8. C(2)HB). 3.63 (1 H. q. J6.5. C(a)H). 3.64. 3.79 (2 x 3H. s. C02Me and OMe). 4.1 6 (1 H. dd. J2A.36.1 . J2B.37.8. 
C(3)H). 6.83 (2H. m. Ph(3)H. Ph(5)H C6H40Me). 7.1 6 (2H. m. Ph(2)H. Ph(6)H C6H40Me). 7.1 6 (1 H. m. Ph(6)H C6H4CI). 7.1 9-7.23 (3H. m, Ph(2)H. Ph(4)H, Ph(5)H C6H4CI); dC 
(1 00MHz. CDCI3) 22.4. 42.1 . 51 .6. 54.2. 55.7. 56.5. 1 1 3.7, 1 25.1 . 1 27.2. 1 27.4, 1 27.6. 1 29.8. 1 34.3. 1 37.6. 1 45.0. 1 58.5, 1 71 .9; m/z (APCI*) 348 (MH+, 1 0%). 



Preparation of (3S,aR)-methyl 3-(3-bromophenyl)-3-(N-a-methyl-4-methoxybenzylamino)propanoate 34 

Following general procedure 5, 30 (750mg, 1.73mmol) was added to a saturated solution of HCI in MeOH (20ml). Concentration in vacuo, recrystallisation (EtOAc:hexane) and 
treatment with saturated aqueous NaHC03 gave 34 (487mg. 66%) as a colourless oil; (a]D20 +1 3.5 (c 1 .0. CHCI3); nmax (film) 3326 (NH). 1 736 (C=0). 1 51 2 (OMe). 1 246 (Ph-OMe). 
dH (400MHz. CDCI3) 1 .38 (3H. d. J6.5. C(a)Me), 1 .84 (1 H, bs. NH). 2.63 (1 H. dd. J2A.2B1 5.4. J2A.36.1 . C(2)HA). 2.72 (1 H. dd. J2B.2A1 5.4. J2B.37.7. C(2)HB). 3.62 (1 H. q. J6.5, 
C(a)H). 3.64. 3.79 (2 x 3H. s, C02Me and OMe). 4.1 4 (1 H. dd. J3.2B7.7. J3.2A6.1 . C(3)H). 6.81 (2H. m, Ph(3)H and Ph(5)H C6H40Me). 7.1 4-7.21 (4H. m. Ph(2)H and Ph(6)H 
C6H40Me. Ph(5)H and Ph(6)H C6H4Br), 7.37 (1 H. m. Ph(2)H C6H4Br). 7.41 (1 H. m. Ph(4)H C6H4Br). dC (1 00MHz, CDCI3) 22.4, 42.1 . 51 .6. 54.3. 55.2. 56.5. 1 1 3.7. 1 22.6. 1 25.6. 
1 27.6. 1 30.1 . 1 30.1 . 1 30.4. 1 37.6. 1 45.3. 1 58.5. 1 71 .8; m/z (CI+) 392 (MH+, 1 5%). 1 35 (C9H1 1 0+, 1 00%); HRMS (CI+) C1 9H23BrN03 requires 392.0861 . found 392.0858. 



Preparation of (3S,aR)-methyl 3-(4-bromophenyl)-3-(N-a-methyl-4-methoxybenzylamino)propanoate 35 

Following general procedure 5, 31 (2.2g, 5.08mmol) was added to a saturated solution of HCI in MeOH (75ml). Concentration in vacuo, recrystallisation (EtOAc:hexane) and 
treatment with saturated aqueous NaHC03 gave 35 (1 45g, 73%) as a colourless oil; [a)D20 +1 4.9 (c 0.9, CHCI3); nmax (film) 2959 (C-H), 1 736 (C=0). 1 51 2 (OMe). 1 245 (Ph-OMe); 
dH (400MHz. CDCI3) 1 .32 (3H. d, J6.5. C(a)Me). 1 .44 (1 H, bs, NH), 2.61 (1 H, dd, J2A.2B1 5.4, J2A.36.4, C(2)HA), 2.71 (1 H, dd, J2B.2A1 5.4, J2B.37.5, C(2)HB), 3.62 (1 H, q, J6.5, 
C(a)H), 3.63, 3.79 (2 x 3H. s, C02Me and OMe), 4.1 4 (1 H, app t J6.9. C(3)H). 6.80-6.84 (2H. m. Ph(3)H and Ph(5)H C6H40Me), 7.1 2-7.1 7 (4H. m, Ph(2)H and Ph(6)H C6H40Me, 
Ph(2)H and Ph(6)H C6H4Br), 7.41-7.44 (2H, m. Ph(3)H and Ph(5)H C6H4Br); dC (1 00MHz. CDCI3) 22.3. 42.0. 51 .6. 54.2. 55.3. 56.2, 1 1 3.7, 1 21 .0. 1 27.6. 1 28.8. 1 29.0. 1 31 .6. 1 31 .7. 
1 37.6, 1 41 .7, 1 50.0, 1 58.5, 1 71 .8; m/z (CI+) 392.1 (MH+, 1 0%), 1 35.1 (C9H1 1 0*. 1 00%); HRMS (ESI) C1 9H23BrN03 requires 392.0861 , found 392.0861 . 



Preparation of (R)-methyl 3-(4-iodophenyl)-3-aminopropanoate 36 

Following general procedure 4, CAN (1 0g. 1 84mmol) and 32 (200mg, 0.46mmol) in 5:1 MeCN:H20 (6ml) gave, after work up and purification by column chromatography on silica 
gel (hexane:Et20 1 :2), 36 (85mg. 61 %) as a yellow oil; [a]D20 +1 1 .5 (c 1 .06. CHCI3); nmax(film) 3375 (NH). 2950 (CH), 1 732 (C=0); dH (400MHz. CDCI3) 2.64 (2H. m, C(2)H2), 3.68 
(3H, s, C02Me), 4.39 (1 H, bs, C(3)H), 7.1 2 (2H, m, Ph(2)H and Ph(6) C6H4I), 7.66 (2H, m, Ph(3)H and Ph(5) C6H4I); dC (1 00MHz, CDCI3) 43.6, 51 .7, 52.1 , 92.7, 1 28.3, 1 37.7, 1 37.8, 
1 72.1 ; m/z (APCI+) 306 (MH+, 25%); HRMS (CI+) C1 0H1 3IN02 requires 305.9991 , found 305.9999. 



Preparation of (S)-methyl 3-(3-chlorophenyl)-3-aminopropanoate 37 

Following general procedure 4, CAN (2.58g, 4.70mmol) and 33 (408mg, 1 .1 8mmol) in 5:1 MeCN:H20 (1 2ml) gave, after work up and purification by column chromatography on silica | 
gel (hexane:Et20 1 :2), 37 (1 35mg. 54%) as a yellow oil; [a]D20 -1 1 .5 (c 1 .0. CHCI3); nmax (film) 3376 (NH). 1 732 (C=0); dH (400MHz. CDCI3) 1 .88 (2H. br s. NH2). 2.1 7 (2H, m. 
C(2)H2). 3.69 (3H. s. C02Me). 4.40 (1 H. m. C(3)H). 7.21 (1 H. t. J7.8. Ph(5)H C6H4CI). 7.29 (1 H. d. J7.8. Ph(6)H C6H4CI). 7.39 (1 H. d. J7.8. Ph(4)H C6H4CI). 7.53 (1 H. s. Ph(2)H 
C6H4CI); dC (1 00MHz. CDCI3) 43.7. 51 .7, 52.1 . 1 24.9. 1 29.4. 1 30.2. 1 30.6. 1 34.9. 1 46.9. 1 72.1 ; m/z (APCI*) 21 4 (MH+, 35%); HRMS (CI+) C1 0H1 3CIN02 requires 21 4.0635; found 
214.0628. 

Preparation of (S)-methyl 3-(3-bromophenyl)-3-aminopropanoate 38 

Following general procedure 4. CAN (1 90g. 3.46mmol) and 34 (338mg. 0.86mmol) in 5:1 MeCN:H20 (1 2ml) gave, after work up and purification by column chromatography on silica M 

Figure 1 A screenshot of the Experimental Data Checker (OSCAR-Data) showing identification and markup of plain text experimental 
data. The initial application of OSCAR was to parse the highly stylised data used to report spectra and other analytical proofs of synthesis. This 
functionality is very widely-used (pers. comm. from RSC staff) and has been re-integrated into OSCAR4 rather than being a separate application. 



applied XML annotations to the document to indicate 
recognised sections of text. OSCAR2 implemented a 
naive Bayesian system based on n-grams and a simple 
grammar in order to identify chemical names within a 
text. These improvements were later extended as part of 
the OSCAR3 project. 

In 2005, the EPSRC awarded a grant ("Sciborg") to 
develop natural language processing (NLP) tools for 
chemistry and science. The chemistry component of 
this project focused on the development of the 
OSCAR2 methodology and resulted in the creation of 
OSCAR3 [18]. OSCAR3 focuses on the recognition of 
and, where appropriate, the resolution of connection 
tables for chemical named entities. OSCAR3 employs a 
naive Bayesian model to identify "chemical" tokens in 
text and offers a choice of two methods for the identi- 
fication of multi-token named entities. The first of 
these, the PatternRecogniser, uses predetermined regu- 
lar-expression style heuristics while the second, the 
MEMMRecogniser [19], employs machine learning in 
the form of a Maximum Entropy Markov Model 



(MEMM). OSCAR3 uses these methods to identify 
four classes of named entity (Chemical, Reaction, Che- 
mical Adjective and Enzyme) as well as dictionary 
lookup to identify a pre-determined set of ontology 
terms and a discrete finite automaton based method to 
identify chemical prefixes. 

In order to convert chemical names to connection 
tables (Figure 2), OSCAR3 uses dictionary-based meth- 
ods and, where this is not successful, OPSIN. Early ver- 
sions of OSCAR directly included the OPSIN code, but 
this was later re-factored into a separate library. 

By 2008, OSCAR was in common use in many labora- 
tories for the identification and extraction of chemical 
terms (chemical named entities) in a variety of texts. 
Our original metrics [18] showed that the precision and 
recall were domain-dependent and varied considerably 
with the purpose and style of chemical texts. Feedback 
from users was informal but it was clear that they were 
modifying OSCAR for their particular purposes both in 
vocabulary and recognition methods. As a result we 
embarked on a major re-factoring program in order to 
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Preparation of N-(2 r 5 -die lilorobenzvl)-S -hydroxy- 1 T 6-naphtlivridme-7-carboxamide 

Triphosgene (0.556g. 1.S7 mniol) was. added over 20 1111113 to a solution of 

8-hydmxy-l r 6-naphtlivridine-7-carboxylic acid (O.S9g. 4.6S mmol) and 
d.i 1 sopropy lethv lam me (3.26 mL 18.7 mmol) inDMF (22 ml) at 0°C. 

2 j 5 -dicfrlor ob enz y lam ine (0.142 ml. 1.05 mmol) was treated with a portion of the above 
solution (0.5SmL r 
for 16 hrs ih nmr 
d f .~=4 . 0 Hz) j a 

4.0 Hz), 7.54 {1H, d, J=Q . 0Hz) , 7.50-7.30 {2E, m) , 4 . £4 (2H, ri, J=5..0 Hz} 
ppm. 

FAB MS calcd for C16H1 1N302C12 348 (MH+), found 348. 



id = oil; surface = 2, 5-dich la ro benzyl amine; type = CM; confidence = 
0.9293289445146479; SMILES = [H]C1=C([H])C[=C(C([H])=C1CI)C[[H])[[H])N[[H1)[H])CI; 
InChI = InChl = 1/C7 H 7CI 2 M/cS- 6-1 -2-7(9) 5(3-6)4- 10/h 1-3 H ,4, 10H 2 : cmlRef = cml6; 




* Experimental data 
* Ontology term 

Chemical (etc.) with structure 
# Chemical (etc.). without 
structure 
* Reaction 

* Chemical adjective 



Figure 2 OSCAR3 markup displaying recognised chemical entities (CM). A mouse-over action on an annotated term displays the associated 
metadata, in this case for 2,5-dichlorobenzylamine, and displays an image representing the structure generated by the Chemistry Development 
Kit (CDK) [35-37] (right). OSCAR3 concentrated on the identification and interpretation of chemical entities in text (named entity recognition, 
NER). The primary purpose was to identify and extract the following types of object: chemicals (CM), ontology terms (ONT; looked-up from ChEBI 
[38-40], FIX [41] and REX [42] etc.), reactions (RN; as identified by linguistic constructs, e.g. "methylated"), chemical adjectives (CJ) mainly formed 
from chemical nouns), enzymes (ASE) and chemical prefixes (CPR), highlighted in different colours. These concepts are maintained in OSCAR4. 



robustify the OSCAR software and simplify the API, and 
this paper describes the results. 

Historical Funding and Collaboration 

It is very difficult to get funding for software engineer- 
ing projects, especially when apparently little changes on 
the surface. We are grateful to the following bodies for 
their funding and interest: 

1. OMII-UK. This organisation existed to support and 
robustify the products of the UK eScience program. 
Many of these were middleware products but OSCAR 
was seen by the UK eScience community as an exam- 
ple of a widely-deployable component that could be 
used in a modern manner in many branches of 
science. The OMII-UK project carried out an initial 
scoping and re-factoring of the OSCAR3 source. 

2. The OSCAR-ChEBI project. This was a competi- 
tive funding resource for eScience products and we 
worked with the European Bioinformatics Institute 
(EBI) to develop OSCAR as an appropriate tool for 
the extraction and verification of chemistry in the 
ChEBI ontology. 



3. CheTA. This was a JlSC-funded project led by our 
group in conjunction with the National Centre for 
Text Mining (NaCTeM) to evaluate the relative mer- 
its of human annotation and machine annotation of 
documents. Part of this project involved OSCAR 
running under the UIMA [20]/U-Compare [21,22] 
framework and required a re-factoring [23]. 

As a result of these projects, which probably 
amounted to two person-years of effort in the re-factor- 
ing, OSCAR4 has now been released in a usable form. 

Limitation of OSCAR3 and design goals for OSCAR4 

OSCAR3 is a powerful tool for chemical natural lan- 
guage processing, but early attempts to develop software 
using it as a library rather than as a standalone applica- 
tion-the ChemicalTagger [14] and PatentEye [24,25] 
projects-exposed weaknesses in the code in this regard. 
The architecture of the software was built around the 
principle that the software would be running as a server 
on the user's local machine. In order to function cor- 
rectly, it required a properly configured workspace. 
Many key components were implemented as mutable 
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singletons (static objects), compromising the thread- 
safety of the application and meaning that safe reconfi- 
guration of a workflow required a complete shutdown 
and restart of the Java virtual machine (JVM). Further- 
more, the implementations of the various OSCAR com- 
ponents required that a document be formatted in 
SciXML as it underwent processing. Consequently, the 
use of OSCAR3 by a client programmer to build sec- 
ondary applications was unintuitive, and the distribution 
and successful use of such applications was found, as 
part of the Green Chain Reaction, to require an unac- 
ceptably high level of support. 

Early attempts to resolve these problems [23] involved 
the extraction of the OSCAR3 tokeniser, MEMMRecog- 
niser and PatternRecogniser components from the main 
OSCAR3 codebase and their conversion into modules 
suitable for use in the popular text-mining framework 
U-Compare. This work allowed the use of OSCAR as 
part of a drag-and-drop workflow, but not its direct 
integration into another application. Consequently, a 
comprehensive overhaul of the OSCAR3 code began in 
autumn 2010 with the aim of producing a well-engi- 
neered, simple, modularised version of OSCAR that 
retained the core OSCAR3 functionality and could be 
easily integrated into external applications. This most 
recent development has been designated OSCAR4 and 
is discussed in the remainder of this paper. 

The development of OSCAR4 sought to address a 
number of specific issues. These are summarised below 
(and in Appendix A) and subsequently discussed in 
greater detail. 

1. To produce an OSCAR library with a simple API, 
suitable for use by client programmers who may not 
be familiar with the internal workings of OSCAR. 
Consequently, while it is desirable for users to be 
able to customise the behaviour of OSCAR in a 
number of ways, initialisation of OSCAR compo- 
nents must by default produce configurations that 
"just work"- the 'convention over configuration' para- 
digm (Appendix B). 

2. In order to run, OSCAR3 required the existence 
of a properly configured workspace-a directory on 
the executing machine that contains the OSCAR 
chemical name dictionary, the InChI [26,27] binary 
file and a properties file along with subdirectories 
intended to contain further resource files. When 
OSCAR3 is first run this workspace is automatically 
created, and when OSCAR3 is used as a library the 
workspace is automatically created in the working 
directory. This behaviour was deemed undesirable, 
unnecessary and found to be a cause of difficulties 
in producing distributable OSCAR-dependent soft- 
ware. Consequently, the removal of the requirement 



for a workspace was considered a high priority of 
the OSCAR4 project. 

3. Much of the OSCAR3 code required that a docu- 
ment undergoing processing is formatted in SciXML. 
Though converters are provided to transform HTML 
into plain text and plain text into SciXML, the 
requirement to perform this transformation is frus- 
trating to the client programmer in that it prevents 
him from working directly with plain text or with a 
custom XML format which may very well be the 
native format of a document that he wishes to pro- 
cess. Consequently, the removal of this SciXML 
dependence was considered important. 

4. In addition to its core functionality-the recogni- 
tion and interpretation of chemical named entities- 
OSCAR3 included a wide range of secondary func- 
tions including the OSCAR3 server. This server runs 
on the local machine and provides an interactive 
demonstration of the capacity of OSCAR3 for text 
processing as well as a number of other utilities 
including the capacity to manually annotate a text 
from within a browser window, a servlet for the 
interconversion and depiction of chemical names 
and formats and an experimental Hearst pattern [28] 
based system for the extraction of chemical relations 
from text. The OSCAR3 codebase had the resem- 
blance of a 'treasure trove' which made code mainte- 
nance a more complex task than necessary. The 
separation of a library containing the core OSCAR 
functionality from these secondary functions was 
therefore considered desirable. 

5. Much of the architecture of OSCAR3 lacked clear 
definition. Excessive use is made of mutable single- 
tons which, while aiding performance by eliminating 
the need for re-initialisation of components, allows 
for complex interactions in the code, making it diffi- 
cult to understand, debug and re-factor. This pro- 
blem was compounded by the manner in which 
program logic is partially controlled by a properties 
object backed by a serialised file. Some of the prop- 
erty values can be modified at runtime while others, 
once accessed by the objects that rely upon them, 
are duplicated in memory and cannot be further 
changed. Attempts to resolve these complex interac- 
tions can have unintended consequences since the 
unit test coverage in OSCAR3 is sparse. Conse- 
quently, the improvement of the architecture of the 
OSCAR software was considered a vital part of the 
OSCAR4 project. 

6. It has been known for some time that the speed 
of OSCAR3 operation could be improved by intro- 
ducing certain optimisations into the code. Using 
the YourKit Java profiler [29], a number of perfor- 
mance blackspots were identified and subsequently 
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eliminated. This work was started after the final ver- 
sion of OSCAR3 (OSCAR3 alpha 5 [30]) and contin- 
ued as part of the OSCAR4 project. 

Library as a design 

OSCAR4 has been deliberately written as a Java library, 
rather than an application or service. Consequently, the 
decoupling of the core OSCAR functionality from appli- 
cations that use this functionality has been achieved. 
The usage of the library has been simplified as much as 
possible with the introduction of the Oscar API object-a 
class intended to wrap the functionality of the wider 
library and provide default implementations of the var- 
ious components. As a result, OSCAR4 can be called 
from external software, as shown in the examples in Fig- 
ure 3. 

In the first of the examples in Figure 3, OSCAR4 is 
used to detect named entities in an input string, return- 
ing a List of NamedEntity objects. In the second, it is 
used to both detect named entities and, where these 
named entities correspond to chemical names, to resolve 
these names to chemical structures-returning a List of 
ResolvedNamedEntity objects. The ResolvedNamedEn- 
tity class links a NamedEntity to a list of chemical struc- 
tures in a number of formats-SMILES, InChI and CML- 
while the NamedEntity class stores such information as 
the surface (raw text) and type (e.g. compound or reac- 
tion) of the named entity and the indices that define its 
position within the source text. The outputs of these 
examples are illustrated in Figure 4. 

The examples above show how OSCAR4 can be used 
without the need for any understanding of the underly- 
ing technology or implementations. An overview of the 
workflow managed by the Oscar API object is shown in 
Figure 5. 

The input is first passed to the Tokeniser to produce a 
list of TokenSequence objects, each of which roughly 



corresponds to a paragraph of text and contains a list of 
Token objects. The Token represents a string of charac- 
ters that mostly correspond to words but also to punc- 
tuation or other discrete units of text e.g. "C 2 H 6 0" or 
"42". In NLP tools, tokenisation commonly occurs at 
whitespace or punctuation boundaries, however due to 
the form of some of the domain-specific entities found 
in chemical texts such as "C-H" a custom Tokeniser is 
used. The TokenSequences are then passed to a Chemi- 
calEntityRecogniser-an interface for a class capable of 
identifying a list of NamedEntities, which are subse- 
quently passed to the ChemNameDictRegistry to create 
a list of ResolvedNamedEntities if required. 

This workflow can be customised by the user, who 
can use the set() methods of the Oscar class to replace 
the components of the default configuration with suita- 
ble customised or custom-built alternatives. Specifically, 
the user can select which implementation of Chemica- 
lEntityRecogniser to use or can specify which set of 
ontology terms are to be recognised and which model 
the default ChemicalEntityRecogniser should use, and 
which dictionary registry, i.e. set of chemical name dic- 
tionaries, to use for name to structure resolution. In 
addition to this, the public APIs of the individual com- 
ponents can be used to assume a greater degree of con- 
trol over the execution of the workflow. 

OSCAR4 provides three implementations of the Che- 
micalEntityRecogniser. The first, the RegexRecogniser, 
finds terms that match a given regular expression and is 
intended to find serial numbers corresponding to com- 
pounds e.g. "NSC-2648". The others, the PatternRecog- 
niser and the MEMMRecogniser, use more complex 
strategies to identify chemical named entities and fea- 
ture subcomponents that can be customised by the user 
to produce the desired behaviour. 

The architecture of the PatternRecogniser is shown 
in Figure 6. A list of "chemical" words is drawn from 
an internal dictionary composed mostly of words 



a) String text = "The quick brown ethyl acetate jumps over the lazy bromine"; 
Oscar oscar = new Oscar (); 

List <NamedEntity> neList = oscar .f indNamedEntities(text) ; 

b) String text = "The quick brown ethyl acetate jumps over the lazy bromine"* 
Oscar oscar = new Oscar (); 

List <ResolvedNamedEntity> rneList = oscar *f indAndResolveNamedEntities(text) ; 

Figure 3 Java code using the OSCAR4 API to a) identify chemical named entities (CNEs) in a block of text and b) identify CNEs and 
resolve their connection tables where possible. 
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The quick brown ethyl acetate jumps over the lazy bromine 



surface = ethyl acetate 
type = CM 
startPos= 16 
endPos = 29 
confidence = 0.980 



surface = bromine 
type = CM 
startPos = 50 
endPos = 57 
confidence = 0.990 



type = SMILES 

source = http ://wwm rn . c h . c am . ac . u k/d iction ary/c heb i/ 
v al li e| 

type = INCHI 

source = http ://wwm rn . c h . c am . ac . u k/d i cti o n ary/c h e b ~\f 

m\i 

type = SMILES 

so u r c e = http ://wwrn rn . c arn . ac . u k/o sc ar/d i cti o n ar y/o p si n/ 
val 



type = CM L 

source = http ://wwm rn . c am . ac . u k/o sc ar/d i cti o n ar y/o p si n 

ral ' type= INCHI 

source = http ://wwm m . c am . ac . u k/o sc ar/cl i cti o n ar y/o p si n/ 
value = lnChl = l/C4H802/cl-3-6-4(2)5/h3H2.1-2H3 



type = CML 

source = http ://wwm m . c am . ac . u k/o sc ar/d i cti o n ar y/o p si n/ 

lk| type - INCHI 

so u r c e = http ://Mrn m . c am . ac . u k/o sc ar/d iction ary/o p si n/ 

nllJ type = SMILES 

source = http ://wwm m . c am . ac. u k/o sc ar/d i cti o n ary/o p si n, 

lln type = INCHI 

source http ://wwm m . c h . c am . ac . u k/d i cti o n ary/c heb i/ 

aig " " 

type = SMILES 

source http ://wwrn m . c h . c am . ac . u k/d i cti o n ary/c heb i/ 
value = [Br] 



Figure 4 Graphic representing the structure of the OSCAR4 API output object. Named entities reference their position in the input text, 
the confidence in their identification and resolved structures in various formats (SMILES [43,44], InChl, CML [45]efc). 



derived from the ChEBI database and from a corpus of 
manually-annotated documents, while a list of "non- 
chemical" words is determined by removing those 
words that occur in the chemical word list from a 
standard English dictionary. These lists are used to 
build an n-gram model which is used by a naive Baye- 
sian classifier to determine whether novel tokens are 
"chemical" or "non-chemical". Multi-token named enti- 
ties, e.g. "ethyl acetate", that occur within the input 
text are then identified by regex-style matching of che- 
mical tokens to a set of pre-specified pattern defini- 
tions such as "*yl "ate". 



The architecture of the MEMMRecogniser is shown in 
Figure 7, in which chemical named entities are identified 
using a Maximum Entropy Markov Model (MEMM). The 
feature set that is generated for each token includes fea- 
tures that describe the token in question, such as the n- 
grams that describe it and the probability that it is chemi- 
cal as predicted by the n-gram model as previously, as well 
as contextual features that describe its neighbouring 
tokens. Using these features, the MEMM model assigns a 
chemical token as being either the first token in a named 
entity or a subsequent token in a named entity. Given 
these assignments, multi-token named entities can be 
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constructed. Novel MEMM models can be built from a 
corpus of hand-annotated documents by the user, and 
OSCAR4 is supplied with two pre-generated models. One 
of these models was built from a set of papers from RSC 
journals [31], while the other was built from a set of 
abstracts retrieved from PubMed [19]. 



Architecture and tests 

The OSCAR4 library has been separated into a number 
of modules with each performing a defined role in the 
operation of the OSCAR code, such as the tokenisation 
of text or the provision of chemical name dictionaries. 
This allows client programmers to use as much or as 
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Figure 6 PatternRecogniser architecture. 
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Figure 7 MEMMRecogniser architecture. 



little of OSCAR in their applications as required, with- 
out the need to unnecessarily pull in a large, compre- 
hensive, single JAR. The process of creating the sub- 
projects had the additional advantage of highlighting the 
ways in which the separate components interact. During 
this process, the readability of the OSCAR code was 
improved by imposing a number of the idioms of 'clean 
code', and the reliability of the code was improved by 
the creation of appropriate unit and regression tests. At 



the time of writing, OSCAR4 has nearly 500 tests. As a 
result, the OSCAR4 code is far more robust than 
OSCAR3, so a developer can work both with and on the 
core OSCAR code with a far greater degree of 
confidence. 

The mutable singletons that were commonplace in 
OSCAR3 have been largely removed. Instead, when set- 
ting up custom workflows, a user has the choice of 
either calling the getDefaultlnstanceQ method or the 



File Edit Window Help 



rs y a [ * m a <^ * [ v ® Jo 1 1 
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» Bioclipse Navigator I 



> k3 BC - OpenTox 

> k3ChemBioMod 

> U3 OpenTox 

> k3 Oscar Demo 

> fc3 QSAR 

> k3 Sample Data 

> Test 

> k3ToxBank 



oscarDemo.js ! 



^ extractedMols.sdf £3 



// Demo showing the Oscar text mining functional 

war html = " <html><body>Benzene and toluene. </bo 

war text = oscar . extractText (html) ; 

// the next step may take some time, while initi 

// software for the first time 

war mols = oscar . f indResolvedNamedEntities (text) 

war file = "/Oscar Demo/extractedMols . sdf " ; 
cdk.saveSDFile(f ile, mols); 

ji .open(f ile) ; 





2D-structure 


cdk: Formula 


DictRef{T=. 
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[] 
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[] 





gle Molecule Headers 
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var file = "/Oscar Demo/extractedMols . sdf " 
cdk. saveSDFile (file, mols); 
ui .open(f ile) ; 



of^l 



257M of 411M 



Figure 8 OSCAR4 run within Bioclipse's scripting interface (centre pane) identifying named entities in a block of text and saving the 
connection tables to file (extractedMols.sdf) for viewing (right pane). 
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Table 1 Results of the initialisation task 



Software version 


0SCAR4 4.0.1 


OSCAR3 alpha 5 


OSCAR4 4.0.1 


OSCAR3 alpha 5 


Recogniser 


M EMM Recogniser 


MEMMRecogniser 


PatternRecogniser 


PatternRecogniser 


Mean time (s) 


14.4 


17.3 


19.7 


24.6 



Standard deviation (ms) 40.8 40.0 72.6 88.6 



default constructor as appropriate-each of which returns 
a preconfigured instance of the class-or using the cus- 
tom constructor which uses dependency injection to 
supply the OSCAR components upon which the class 
depends. For example, the OntologyTerms class repre- 
sents a set of ontology terms and their corresponding 
ontology IDs. The following two methods of obtaining 
an OntologyTerms object are available: 

OntologyTerms.getDefaultlnstanceO; 

new OntologyTerms(ListMultimap < String, String > 

terms); 

The first method returns the default OSCAR4 Ontolo- 
gyTerms object, which contains an amalgamation of the 
terms from the ChEBI, FIX and REX ontologies while 
the second supplies a multimap of ontology terms to 
IDs. The use of this design pattern throughout the code- 
base permits, but by no means requires, a user to 
assume a high degree of control over the functioning of 
OSCAR. 

The use of the properties file and object to control 
elements of the program execution has been removed. 
Instead, the required information is either specified as 
part of a constructor's signature or using a set() method 
on the object in question. This improves the thread- 
safety of OSCAR, particularly in a multiuser environ- 
ment, and contributes to its usability since a user can 
now trivially see what features may be customised from 
the outline of the class as opposed to needing to know 
which and how properties are used by which 
components. 

Input and Output Formats 

As previously discussed, OSCAR3 required that input 
documents be converted into SciXML before processing 
can occur, using the document formatting as a base 
against which annotations for identified named entities 
can be referenced-whether as inline or standoff annota- 
tions. XML input turned out to be overly complex as 



NLP tools require "flat" relatively sequential tokens. The 
XML markup adds little useful context. OSCAR4 
removes this requirement by operating on plain text and 
producing NamedEntity and DataAnnotation objects to 
represent recognised sections of text and does not cur- 
rently produce serialised output, though some support 
for the serialisation of annotations into XML documents 
is planned for future releases. It should be realised, how- 
ever, that there is no single, fool-proof approach to this 
problem. Different XML schema may use different 
methods to indicate where in the document section 
breaks and even text content occur, while it cannot be 
guaranteed that well-formed inline annotations can be 
generated for a given input document. Client program- 
mers are therefore recommended to consume Name- 
dEntity objects directly rather than rely upon serialised 
output, though it is realised that users are likely to want 
to be able to create serialised, marked-up copies of their 
documents as well. 

Non-core functionality 

Non-critical code (particularly downstream applications) 
has been removed from the OSCAR4 codebase to reflect 
the philosophy that OSCAR4 should act as a library. 
While some minor supporting code remains, such as 
that required for generation of key resource files, the 
majority has been removed entirely as it is envisaged 
that much of the former functionality could be better 
implemented by developers with specific use cases. 

A number of useful non-core functions are provided 
in dependent libraries developed at the Unilever Centre 
in Cambridge. Specifically, subsidiary modules exist to 
provide the capacity to run OSCAR4 from the com- 
mand-line, as part of UIMA or Taverna [32] workflows 
and from the Bioclipse [33] scripting interface, as shown 
in Figure 8. 

Performance 

A number of modifications were introduced to the 
OSCAR code with the aim of reducing the time 



Table 2 Results of the bulk processing task 

Software version OSCAR4 4.0.1 OSCAR3 alpha 5 OSCAR4 4.0.1 OSCAR3 alpha 5 

Recogniser MEMMRecogniser MEMMRecogniser PatternRecogniser PatternRecogniser 

Mean time (s) 446 541 150 276 

Standard deviation (s) 1.85 1.14 0.556 1.53 
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required to process documents. Performance hotspots 
were identified using the YourKit Java profiler and 
where possible eliminated. Some such improvements 
focused on the time taken to initialise the various 
OSCAR components, such as supplying a pre-calculated, 
serialised copy of the n-gram models used for named 
entity recognition rather than regenerating them each 
time OSCAR is loaded. Others improved the speed at 
which OSCAR can process a document by optimising 
extremely tight loops in the code, such as eliminating 
unnecessary string declaration while calculating n-gram 
features and avoiding recompilation of regular expres- 
sions. Further improvements were made ad hoc, as the 
OSCAR4 developers encountered obvious bottlenecks 
while working on the code. 

In order to quantify the improvement in speed of 
operation, the time taken by both OSCAR4 version 4.0.1 
and OSCAR3 alpha 5 to perform two tasks was mea- 
sured. The first task measured the time taken to initia- 
lise the software to the point that it was ready to begin 
the task of finding named entities in text; the second 
task aimed to measure the speed at which the software 
could process bulk text and consisted of processing the 
full text of the 68 patents published by the European 
Patent Office in the week of 2009-05-06-a total of 11468 
paragraphs of text. All the tasks were run on a desktop 
computer equipped with an Intel Pentium 4 (3.00 GHz) 
CPU and 1 GB of RAM, purchased c. 2005, running 
openSUSE 11.1 and using the Java 1.6.0_22 32-bit virtual 
machine with a maximum heap size of 512 MB. The 
results are summarised in Table 1 and Table 2. 

From these data, it can be seen that OSCAR4 per- 
forms significantly faster than OSCAR3. Initialisation 
times for the MEMMRecogniser and PatternRecogniser 
have been reduced by 17% and 20% respectively, while 
bulk processing times have been reduced by 18% and 
46% respectively. The OSCAR4 MEMMRecogniser and 
PatternRecogniser processed approximately 26 and 76 
paragraphs per second respectively, demonstrating that 
bulk processing of text is achievable on an acceptable 
timescale on desktop computers. 

Deployment 

OSCAR4 has generated significant interest in the com- 
munity, and has been the subject of two meetings at the 
Unilever Centre for Molecular Science Informatics in 
Cambridge. The talks from the second of these are 
available to view online [34]. To our knowledge, the 
software is in use at the National Centre for Text 
Mining (NaCTeM), the European Bioinformatics Insti- 
tute (EBI) and the European Patent Office (EPO) as well 
as various pharmaceutical companies. 

We are aware of successful and straightforward inte- 
grations into the Bioclipse and Taverna frameworks, and 



believe that this is similarly straightforward for other 
Java environments. We were also pleased to see that at 
the recent MIOSS meeting at the EBI, OSCAR and 
OPSIN had been integrated into the .NET environment. 
For example, OPSIN was demonstrated as running 
within the JVM in Microsoft Excel, which is acceptable 
to commercial organisations as the JVM is of proven 
security. 

Future Prospects 

This is a useful opportunity to reflect on the high cost of 
producing robust, re-usable software. OSCAR3, and 
OPSIN, were produced as a continuing activity by a mix- 
ture of summer students, PhDs and PDRAs and, until ca. 
2009, evolved rather than having a top-down software 
design. When the project became valuable to the world, it 
was a clear indication that re-factoring was going to be 
essential, and it is important to realise the necessary but 
high cost of doing this. In times of lean funding, it will 
become increasingly difficult to obtain this type of support, 
and therefore it is always tempting to transfer academic 
code to commercial entities which can raise revenue. 

The downside of this is that we know of very few 
commercial codes, and certainly none in chemical text 
analysis, that provide public metrics let alone expose the 
architecture on which the program is based. Text- 
mining as an academic subject requires metrics and 
increasingly requires Openness of the components of 
the system, as we have done in OSCAR and OPSIN. We 
are investigating continuing business models where we 
can continue to re-factor and improve the product while 
not closing the code and therefore reducing scientific 
credibility and innovation. 

Very recently we have been exploring the use of OSCAR 
for areas other than organic and biological chemistry. 
Because OSCAR can be customised by different diction- 
aries, we have been able to adapt it to process reports of 
atmospheric chemistry and, more generally, atmospheric 
science. In conjunction with the European Geosciences 
Union (EGU, which publishes Open Access papers), we 
have analysed abstracts and full text for chemical entities 
and related numerical quantities {e.g. amounts, conditions 
etc.) This has led to a design where the domain-indepen- 
dent parts of OSCAR4 can be applied to many physical 
sciences with bespoke dictionaries and ChemicalTagger 
rules. We have submitted grants in both the biosciences 
("OSCAR-BIO") and physical sciences ("OSCAR-PHYS"). 
As part of this work, we will be actively addressing generic 
tools for metrics and training. 

Appendixes 

Appendix A: Additional OSCAR4 resources 

The source code, mailing list, tutorials, documentation 
and support are available at 
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https://bitbucket.org/wwmm/oscar4/wiki/Home 
This page also includes instructions for accessing pre- 
compiled JAR files from the Unilever Centre's Maven 
repository. 

The source code used to measure OSCAR perfor- 
mance is available at https://bitbucket.org/dmj30/oscar- 
performance 

The OSCAR4 Javadoc is available at http://apidoc.ch. 
cam.ac.uk/oscar4-4.0. 1 

Appendix B: Building on the 0SCAR4 API 

The core methods are given in each case. In some cases 
it will be valuable to extract further information recur- 
sively from the results. 

a) Searching a given text for Named Entities. These 
can then be displayed, computed etc. 

Oscar oscar = new Oscar(); 

List < NamedEntity >namedEntities 

= oscar. findNamedEntities (text); 

b) Where the named entity can be resolved to a che- 
mical structure, extract it: 

Oscar oscar = newOscar(); 

List < ResolvedNamedEntity > entities 

= oscar.findAndResolveNamedEntities(s); 

for (ResolvedNamedEntity entity : entities) { 

ChemicalStructure structure = entity.get- 

FirstChemicalStructurehttp://(FormatType. 

INCHI)); 

} 

c) Find only those entities which are resolvable to 
structures {e.g. "benzene" but not " the methyl ester": 

Oscar oscar = newOscar(); 

List < ResolvedNamedEntity > entities 

= oscar.findResolvableEntities(s); 

d) Tailor the system to use different recognizers and 
dictionaries: 

ChemicalEntityRecogniser myRecogniser = new- 

PatternRecogniserQ 

Oscar oscar = newOscar(); 

oscar.setRecogniser(myRecogniser); 

oscar. setDictionaryRegistry 

(myDictionaryRegistry); 

List < ResolvedNamedEntity >entities = oscar. 
findResolvableEntities(s); 
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